Clusters and features from combinatorial 
stochastic processes 

Tamara Broderick, Michael I. Jordan, Jim Pitman 
March 8, 2013 

Abstract 

In partitioning — a.k.a. clustering — data, we associate each data point 
with one and only one of some collection of groups called clusters or par- 
tition blocks. Here, we formally develop an analogous problem, called 
feature allocation, for associating data points with arbitrary non-negative 
integer numbers of groups, now called features or topics. We review known 
combinatorial stochastic process representations of clustering and develop 
analogous representations for the feature allocation case. We illustrate 
the clustering representations with examples that include the canonical 
nonparametric Bayesian clustering prior: the Chinese restaurant process 
or Dirichlet process. We not only illustrate the feature allocation repre- 
sentations with the canonical nonparametric Bayesian feature prior — the 
Indian buffet process or beta process — but also simultaneously discover 
new connections between the different representations for the Indian buffet 
process. We thereby bring the same level of completeness to the treatment 
of the Indian buffet that has previously been developed for the Chinese 
restaurant. 

1 Introduction 

Bayesian nonparamctrics is the area of Bayesian analysis in which the finite- 
dimensional prior distributions of classical Bayesian analysis are replaced with 
stochastic processes. While the rationale for allowing infinite collections of ran- 
dom variables into Bayesian inference is often taken to be that of diminishing 
the role of prior assumptions, it is also possible to view the move to nonparamet- 
rics as supplying the Bayesian paradigm with a richer collection of distributions 
with which to express prior belief, thus in some sense emphasizing the role of 
the prior. In practice, however, the field has been dominated by two stochas- 
tic processes — the Gaussian process and the Dirichlet process — and thus the 
flexibility promised by the nonparametric approach has arguably not yet been 
delivered. In the current paper we aim to provide a broader perspective on the 
kinds of stochastic processes that can provide a useful toolbox for Bayesian non- 
parametric analysis. Specifically, we focus on combinatorial stochastic processes 
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as embodying mathematical structure that is useful for both model specification 
and inference. 

The phrase "co mbinatorial stochastic process" comes from probability the- 
ory [Pitman . 2006^ . where it refers to connections between stochastic processes 
and the mathematical field of combinatorics. Indeed, the focus in this area of 
probability theory is on random versions of classical combinatorial objects such 
as partitions, trees, and graphs — and on the role of combinatorial analysis in 
establishing properties of these processes. As we wish to argue, this connection 
is also fruitful in a statistical setting. Roughly speaking, in statistics it is often 
natural to model observed data as arising from a combination of underlying 
factors. In the Bayesian setting, such models are often embodied as latent vari- 
able models in which the latent variable has a compositional structure. Making 
explicit use of ideas from combinatorics in latent variable modeling can not 
only suggest new modeling ideas, but it can also provide essential help with 
calculations of marginal and conditional probability distributions. 

The Dirichlct process already serves as one interesting exhibit of the connec- 
tions between Bayesian nonparametrics and combinatorial stochastic processes. 
On the one hand, the Diri chlet process is c lassically defined in terms of a parti- 
tion of a probability space [Ferguson . 1973l |. and there are many well-known con- 
necti o ns between th e Dirichlet process and urn models [Blackwell and MacQueeii . 
19731 iHoppd . Il984 |. In the current paper, we will review and expand upon 



some of these connections, beginning our treatment (non-traditionally) with 
the notion of an exchangeable partition probability function (EPPF) and, from 
there, discussing related urn models, stick-breaking representations, subordina- 
tors, and random measures. 

On the other hand, the Dirichlet process is limited in terms of the statistical 
notion of "combination of underlying factors" that we referred to above. Indeed, 
the Dirichlet process is generally used in a statistical setting to express the idea 
that each data point is associated with one and only one underlying factor. In 
contrast to such clustering models, we wish to also consider featural models, 
where each data point is associated with a set of underlying features and it is 
the interaction among these features that gives rise to an observed data point. 
Focusing on the case in which these features are binary, we develop some of the 
combinatorial stochastic process machinery needed to specify featural priors. 
Specifically, we develop a counterpart to the EPPF, which we refer to as the 
exchangeable feature probability function (EFPF), that characterizes the com- 
binatorial structure of certain featural models. We again develop connections 
between this combinatorial function and suite of related stochastic processes, 
including urn models, stick-breaking representations, subordinators, and ran- 
dom measures. As we will discuss, a particular underlying r andom measure in 
this case is the beta process, originally studied bv lHiortI 1990| as a model of ran- 
dom h azard functions in survival analysis, but adapted bv lThibaux and Jordan 



20071 1 for applications in featural modeling. 



For statistical applications it is not enough to develop expressive prior spec- 
ifications, but it is also essential that inferential computations involving the 
posterior distribution are tractable. One of the reasons for the popularity of the 
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Dirichlet process is that the associated urn models an d stick - break ing represen- 
tations yield a variety of useful inference algorithms jNeal . 200Cl| . As we will 



see, analogous algorithms arc available for featural models. Thus, as we discuss 
each of the various representations associated with both the Dirichlet process 
and the beta process, we will also (briefly) discuss some of the consequences of 
each for posterior inference. 

The remainder of the paper is organized as follows. Wc start by reviewing 
partitions and introducing feature allocations in Section [5] in order to define 
distributions over these models (Section via the exchangeable partition prob- 
ability function (EPPF) in the partition case fSection l3.ip and the exchangeable 
feature probability function (EFPF) in the feature allocation case (Section l3.2p . 
Illustrating these exchangeable probability functions with examples, we will see 
that the well-known Chinese restaurant process (CRP) [Blackwell and MacQueen . 
1973 , Aldousl . ri985 | corresponds to a particular EPPF choice (Ex ample [B and 



the Indian buffet process (IBP) |Grifhths and Ghahramani . 2006l | corresponds 
to a particular choice of EFPF (Example [5]). From here, we progressively build 
up richer models by first reviewing stick lengths (Section!?]), which we will see 
represent limiting frequencies of certain clusters or features, and then subor- 
dinators (Section O, which further associate a random label with each cluster 
or feature. We illustrate these progressive augmentations on the CRP (Exam- 
ples n El Uni nil and Uni) and IBP examples (Examples El H [IH and US]) . We 
augment the model once more to obtain a random measure on a general space 
of cluster or feature parameters in Section [6l here, among other relations, we 
find that the CRP example is the marginal of a Dirichlet process (Example [23]) . 
and the IBP example is the marginal of a beta process (Example [24| . Finally, 
in Section [71 we mention some of the other combinatorial stochastic processes, 
beyond the Dirichlet process and the beta process, that have begun to be stud- 
ied in the Bayesian nonparamctrics literature, and we provide suggestions for 
further developments. 



2 Partitions and feature allocations 

While we have some intuitive ideas about what constitutes a mixture or ad- 
mixture model, we want to formalize these ideas before proceeding. We begin 
with the underlying combinatorial structure on the data indices. We think of 
[N] := {1, . . . , N} as representing the indices of the first N data points. There 
are different groupings that we apply in the mixture case (partitions) and ad- 
mixture case [feature allocations); we describe these below. 

First, we wish to describe the space of partitions over the indices [N]. In 
particular, a partition ttn of [N] is defined to be a collection of mutually 
exclusive, mutually exhaustive, non-empty subsets of [N] called blocks; that 
is, TTN = {Ai, . . . , Ak} for some number of partition blocks K. An exam- 
ple partition of [6] is ttq — {{1, 3, 4}, {2}, {5, 6}}. Similarly, a partition of 
N = {1,2,...} is a collection of mutually exclusive, mutually exhaustive non- 
empty subsets of N. In this case, the number of blocks may be infinite, and 
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we have ttjv = {Ai, A2, . . .}. An example partition of N into two blocks is 
{{n : n is even}, {n : n is odd}}. 

We introduce a generalization of a partition called a feature allocation that 
relaxes both the mutually exclusive and mutually exhaustive restrictions. In 
particular, a feature allocation /at of [TV] is defined to be a multiset of non- 
empty subsets of [N], again called blocks, such that no index n belongs to 
infinitely many blocks. We write /jv = {Ai,...,Ak} for some number of 
feature allocation blocks K. An example feature allocation of [6] is fe = 
{{2, 3}, {2, 4, 6}, {3}, {3}, {3}}. Just as the blocks of a partition are sometimes 
called clusters, so are the blocks of a feature allocation sometimes called fea- 
tures. We note that a partition is always a feature allocation, but the converse 
statement does not hold in general; for instance, /g given above is not a parti- 
tion. 

In the remainder of this section, we continue our development in terms of 
feature allocations since partitions arc a special case of the former object. We 
note that we can extend the idea of random partitions (Aldoud . [l985t to consider 



random feature allocations. If is the space of all feature allocations of [N] , 
then a random feature allocation Fpf of [N] is a random element of this space. 

We next introduce a few useful assumptions on our random feature alloca- 
tion. Just as exchangeability of observations is often a central assumption in 
statistical modeling, so will we make use of exchangeable feature allocations. To 
rigorously define such feature allocations, we introduce the following notation. 
Let cr : N — > N be a finite permutation. That is, for some finite value Ncr, 
we have cr(n) = n for all n > N^. Further, for any block A C N, denote the 
permutation applied to the block as follows: cr{A) := {a{n) : n G A}. For any 
feature allocation Fn, denote the permutation applied to the feature allocation 
as follows: (t(F/v) := {o'{A) : A e Fn}. Finally, let Fn be a random feature 

allocation of [N]. Then we say that Fjv is exchangeable if Fpf = (t{Fn) for every 
finite permutation a. 

Our second assumption in what follows will be that wc arc dealing with a 
consistent feature allocation. Wc often implicitly imagine the indices arriving 
one at a time: first 1, then 2, up to N or beyond. Wc will find it useful, similarly, 
in defining random feature allocations to suppose that the randomness at stage 
n somehow agrees with the randomness at stage n + 1. More formally, we say 
that a feature allocation /a/ of [M] is a restriction of a feature allocation /jv of 
[N] for M < iV if 

/m = {An[M]:AefN}. 

Let 72.jv(/j\/) be the set of all feature allocations of [N] whose restriction to [M] 
is /m- Then we say that the sequence of random feature allocations (Fn) is 
consistent if (1) there is some function q such that q{fN) is the probability that 
Fn takes the value /at for each N and (2) moreover, for all M and N such that 
M < N, we have 

/«e7j„(/jv,) 
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With this consistency condition in hand, we can define a random feature 
allocation F^a of N. In particular, such a feature allocation is characterized 
by the sequence of consistent finite restrictions F/v to [A^]: Fn {AO [N] : 
A S i^oo}- Then i^oo is equivalent to a consistent sequence of finite feature 
allocations and may be thought of as a random element of the space of such 
sequences: Foo — {Fn)n- We let Too denote the space of consistent feature 
allocations, of which each random feature allocation is a random element, and 
we see that the sigma-algebra associated with this space is generated by the 
finite-dimensional sigma-algebras of the restricted random feature allocations 

We say that F^o is exchangeable if Foo = cr{Foo) for every finite permutation 
a. That is, when the permutation a changes no indices above TV, we require 

Fn == cr(FAr), where Fn is the restriction of Foo to [N]. 

In what follows, we consider exchangeable, consistent random partitions and 
feature allocations. 



3 Exchangeable probability functions 

Once we know that we can construct (exchangeable and consistent) random 
partitions and feature allocations, it remains to find useful ways of representing 
the distributions q, as in Eq. ([1]), over these objects. 

3.1 Exchangeable partition probability function 

Consider first an exchangeable, consistent, random partition (n„). From Eq. ([T]), 
we have a function q describing the distribution of the partition. By the ex- 
changeability assumption, this distribution should depend only on the (un- 
ordered) sizes of the blocks. Therefore, there is further a function p that is 
symmetric in its arguments such that, for any specific partition assignment 
7r„ = {j4i, . . . , Ak}; we have 

q{Iln = TTn) = p{\Ai\, . . . ,\Ak\). 



The f unction p is called the exchangeable partition probability function (EPPF) Pitmanl . 
llQQRj . 



Exam ple 1 (Chinese restaurant process). The Chinese restaurant process (CRP) [Blackwell and MacQueen 
1973j is an iterative description of a partition via the conditional distribution 



of increasing partition index labels. The Chinese restaurant metaphor forms an 
equivalence between customers entering a Chinese restaurant and partition in- 
dices; customers who share a table at the restaurant represent indices belonging 
to the same partition block. 

To generate the first index label, the first customer enters the restaurant 
and sits down at some table, necessarily unoccupied since no one else is in the 
restaurant. A "dish" is set out at the new table; call the dish "1" since it 
is the first dish. The customer is assigned the label of the dish at her table: 
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Figure 1: The diagram represents a possible CRP seating arrangement after 11 
customers have entered a restaurant with parameter 6. Each large white circle 
is a table, and the smaller gray circles are customers sitting at those tables. 
If a 12th customer enters, the expressions in the middle of each table give the 
probability of the new customer sitting there. In particular, the probability of 
the 12th customer sitting at the first table is 5/(11 + 0), and the probability of 
the 12th customer forming a new table is 6*7(11 + 0). 



Zi = 1. Recursively, for a restaurant with concentration parameter 0, the nth 
customer sits at an occupied table with probability in proportion to the number 
of people at the table and at a new table with probability proportional to 0. In 
the former case, Z„ takes the value of the existing dish at the table, and in the 
latter case, the next available dish k (equal to the number of existing tables plus 
one) appears at the new table, and Z„ = k. By summing over all possibilities 
when the nth customer arrives, one obtains the normalizing constant for the 
distribution across potential occupied tables: (n — 1 + 0)~^ . To summarize, if 
we let Kn '■= max{Zi, . . . , Z„}, then the distribution of table assignments for 
the nth customer is 

P(Z„ = fc|Zi,...,Z„_i) 

1 -X- / ^•t"^ --nKn, Z,n = j} for j < Kn^i 
+ [O for fc = Kn-i + 1 

We note that an equivalent generative description follows a Polya urn style 
in specifying that each incoming customer sits next to an existing customer with 
probability proportional to 1 and forms a new table with probability propor- 
tional to (Hoppel . [l98i |. 



Next, we find the probability of the partition induced by considering the 
collection of indices sitting at each table as a block in the partition. Suppose that 
the set of cardinalities of non-zero table occupancies is {iVi, . . . , Nk} with N := 
^/.^i Nk- That is, we are considering the case when N customers have entered 
the restaurant and sat at K different tables in the specified configuration. We 
can see from Eq. © that when the nth customer enters (n > 1), we obtain a 
factor oi n — 1 + 6 in the denominator. Using the following notation for the 
rising and falling factorial 

M-l M-1 

XMfa II (2; + ma), XMia ■= n (2; - ma), 

?n— m— 

we find that the denominator must be {0 + l)jv-iti- Similarly, each time a cus- 
tomer forms a new table except for the first table, we obtain a factor of 9 in the 
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numerator. Combining these factors, we find a factor of 9^~^ in the numerator. 
Finally, each time a customer sits at an existing table with n occupants, we 
obtain a factor of n in the numerator. Thus, for each table k, we have a factor 
of {Nk — 1)! once all customers have entered the restaurant. Having collected all 
terms in the process, we see that the probability of the resulting configuration 

q(llN - TTjv) - ■ (3) 

[0 + ij7V-i-|-i 

We first note that Eq. ^ depends only on the block sizes and not on the order of 
arrival of the customers or dishes at the tables. We conclude that the partition 
generated according to the CRP scheme is exchangeable. Moreover, as the 
partition Hm is the restriction of n^v to [M] for any N > M by construction, 
we have that Eq. ^ satisfies the consistency condition of Eq. ([T]). It follows 
that Eq. © is, in fact, an EPPF. ■ 



3.2 Exchangeable feature probability function 

Just as we considered an exchangeable, consistent, random partition above, 
so wc now turn to an exchangeable, consistent, random feature allocation (Fn) 
now. Let /^r = {^i, • • • , -^k} be any particular feature allocation. In calculating 
QiFN — In)) we start by demonstrating in the next example that this probability 
in some sense undercounts features when they contain exactly the same indices: 
e.g., Aj = Ak for some j ^ k. For instance, consider the following example. 

Example 2 (A two-block. Bernoulli feature allocation). Let qA,<lB S (0? 1) 

represent the frequencies of features A and B. Draw Z^ n Bern(g^) and 

ZB,n ~' Bern((7B), independently. Construct the random feature allocation by 
collecting those indices with successful draws: 

Fn := {{n:n< N, Za^u = 1}, {n : n < N, Zb,„ = 1}}. 
Then the probability of the feature allocation F5 = /s := {{2, 3}, {2, 3}} is 

gi(i-<zA)'<zl(i-te)', 

but the probability of the feature allocation F5 = /g := {{2, 3}, {2, 5}} is 

2gi(l - - 9^)3. 

The difference is that in the latter case the features can be distinguished, and so 
we must account for the two possible pairings of features to frequencies {g^i, Qb}- 
Now, instead, let F^ be F^ with a uniform random ordering on the features. 
There is just a single possible ordering of f^, so the probability of F/v = /a := 
({2, 3}, {2, 3}) is again 

qlil-qAfqli^-qB?. 
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However, there are two orderings of /g, so the probabihty of Fjv ^ fi ■= 
({2, 5}, {2, 3}) is 

q\{l~qAfq%{l~qB)\ 
and the same holds for the other ordering. ■ 

For reasons suggested by the previous example, we will find it useful to work 
with the random feature allocation after uniform random ordering, F^. One 
way to achieve such an ordering and maintain consistency across different N is 
to associate some independent, continuous random variable with each feature; 
e.g. assign a uniform random variable on [0, 1] to each feature and order the 
features according to the order of the assigned random variables. When we view 
feature allocations constructed as marginals of a subordinator in Section [51 we 
will see that this construction is natural. 

In general, given a probability of a random feature allocation, q{Fj^ = /at), 
we can find the probability of a random ordered feature allocation, q{FN = Jn) 
as follows. Let H be the number of unique elements of Fn, and let {Ki, . . . , Kh) 
be the multiplicities of these unique elements in decreasing size. Then 

qipN = In) = (^^^ ^ q{FN = /at), (4) 



where 



K \ K\ 



We will see in Section [5] that augmentation of an exchangeable partition 
with a random ordering is also natural. However, the probability of an ordered 
random partition is not substantively different from the probability of an un- 
ordered version since the factor contributed by ordering a partition is always 
where K here is the number of partition blocks. 

With this framework in place, we can see that some ordered feature allo- 
cations have a probability function p nearly as in Eq. ^ that is, moreover, 
symmetric in its block-size arguments. Consider again the previous example. 

Example 3 (A two-block, Bernoulli feature allocation (continued)). Consider 
any Fn with block sizes A'^i and N2 constructed as in Example [2j Then 

q{Fr, = M = Iq^Hl - qAf-^'^q^Hl qEf-""^ 

= p(iV,iVi,7V2), (5) 

where p is some function of the number of indices N and the block sizes (A^i, N2) 
that we note is symmetric in all arguments after the first. In particular, we see 
that the order of A^i and N2 was immaterial. ■ 
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We note that in the partition case, Xlfc=i l^fcl = N, so N is imphcitly an 
argument to the EPPF. In the feature case, this summation condition no longer 
holds, so wc make the argument N explicit in Eq. ([S]). 

However, it is not necessarily the case that such a function, much less a 
symmetric one, exists for exchangeable feature models — in contrast to the case 
of exchangeable partitions and the EPPF. 

Example 4 (A general two-block feature allocation). We here describe an ex- 
changeable, consistent random feature allocation whose (ordered) distribution 
does not depend only on the number of indices N and the sizes of the blocks of 
the feature allocation. 

Let pi,p2,P3,P4 be fixed frequencies that sum to one. Let Yn represent 
the collection of features to which index n belongs. For n E {1,2}, choose y„ 
independently and identically according to: 

{{1} with probability pi 

{2} with probability p2 

{1,2} with probability p3 

with probability p4 

We form a feature allocation from these labels as follows. For each label (1 or 
2), collect those indices n with the given label appearing in Yn to form a feature. 

Now consider two possible outcome feature allocations: /2 = {{2}, {2}}, and 
/2 — {{1)7 {2}}- The likelihood of any ordering [2 of /2 under this model is 

f(A = /2)-p! p°2 pI pI 

The likelihood of any ordering of /' is 

P(A = /2)=p} pI pI pI 

It follows from these two likelihoods that we can choose values of pi,p2,P3,P4 
such that P(i^2 = fi) 7^ P(^2 = /2)- But /2 and /2 have the same block counts 
and N value [N = 2). So there can be no such symmetric function p, as in 
Eq. ([5]), for this model. ■ 

When a function p exists in the form 

q{FN^fN)=p{N,\Ail...,\AK\) (6) 

for some ordered feature allocation /jv ~ (^1, . • . , Ak) such that p is symmetric 
in all arguments after the first, we call it the exchangeable feature probability 
function (EFPF). We take care to note that the EPPF is not a special case of 
the EFPF. Indeed, the EPPF assigns zero probability to any multiset in which 
an index occurs in more than one element of the multiset whereas the indices in 
the multiset provide no information to the EFPF; only the sizes of the multiset 
blocks are relevant in the EFPF case. 

We next consider a more complex example of an EFPF. 
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Figure 2: Illustration of an Indian buffet process. The buffet (top) consists of a 
vector of dishes, corresponding to features. Each customer — corresponding to a 
data point — who enters first decides whether or not to eat dishes that the other 
customers have already sampled and then tries a random number of new dishes, 
not previously sampled by any customer. A gray box in position (n, fc) indicates 
customer n has sampled dish fc, and a white box indicates the customer has not 
sampled the dish. In the example, the second customer has sampled exactly 
those dishes indexed by 2, 4, and 5: Yi = {2, 4, 5}. 



Exam ple 5 (Indian buffet process). The Indian buffet process (IBP) [Griffiths and Ghahramani , 
2006| is a generative model for a random feature allocation that is specified re- 
cursively like the Chinese restaurant process. Also like the CRP, this culinary 
metaphor forms an equivalence between customers and the indices n that will 
be partitioned: n G N. Here, "dishes" again correspond to feature labels just 
as they corresponded to partition labels for the CRP. But in the IBP case, a 
customer can sample multiple dishes. 

In particular, we start with a single customer, who enters the buffet and 
chooses K'^ ~ Pois(7) dishes. Here, 7 > is called the mass 'parameter, and we 
will also see the concentration parameter > below. None of the dishes have 
been sampled by any other customers since no other customers have yet entered 
the restaurant. We label the dishes 1, . . . , if > 0. Recursively, the nth 
customer chooses which dishes to sample in two parts. First, for each dish k 
that has previously been sampled by any customer in 1 , . . . , n — 1 , customer n 
samples dish k with probability Nn-i^k/ {9 + n — l) for Nn^k equal to the number 
of customers indexed 1, . . . , n who have tried dish k. As each dish represents a 
feature, and sampling a dish represents that the customer index n belongs to 
that feature, Nn^k is the size of the block of the feature labeled k in the feature 
allocation of [n]. Next, customer n chooses A'+ ^ Pois(07/(6'-|-n — 1)) new dishes 
to try. If > 0, then the dishes receive unique labels A'„_i + 1, . . . ,Kn. 
Here, Kn represents the number of sampled dishes after n customers: /v„ = 

With this generative model in hand, we can find the probability of a par- 
ticular feature allocation. We discover its form by enumeration as for the CRP 
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EPPF in Example [TJ At each round n, we have a Poisson number of new fea- 
tures, K^, represented. The probabihty factor associated with these choices is 
a product of Poisson densities. 



N 

n 



6I7 



n-1 



cxp 



0j 



' + n- 1 



Let Mk be the round on which the fcth dish, in order of appearance, is first 
chosen. Then the denominators for future dish choice probabihties are the 
factors in the product (6* + • (6* + Affc + 1) • • • (6* + iV - 1). The numerators for 
the times when the dish is chosen are the factors in the product 1-2 •■ • (iVjv_fc — 1). 
The numerators for the times when the dish is not chosen yield {6 + Mk — 
1) • ■ • {9+N—l—NN,k)- Let An,k represent the collection of indices in the feature 
with label k after n customers have entered the restaurant. Then Nn,k = |^Ti,fc|. 
Finally, let Ki , . . . , Kh be the multiplicities of unique feature blocks formed by 
this model. We note that there are 



■ N 

n 

n=l 



K+\ 



H 



.h=l 



rearrangements of the features generated by this process that all yield the same 
feature allocation. Since they all have the same generating probability, we sim- 
ply multiply by this factor to find the feature allocation probability. Multiplying 



all factors together and taking /„ = {A 



q{FN = I. 



N 



iln=l ^^n ■ 



N 

n 



1 



6*7 



K2 



cxp 



,An,Kn} yields 



9j 



+ n-l 



r^^^.'^) r{e + M,-i) 

6*7 



_n ne + N) 
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n 



T{NN,k)T{e + N- NN.k) 
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nLi(^+«-i)^" 



n^^^.! (07)''" cxp -07^ ( 



1) Mn 



\h=l 



k=l 



TiNN,k)T{N-NN,k + 0) 

r{N + e) 



It follows from Eq. Q that the probability of a uniform random ordering of 
the feature allocation is 



q{FN In) 



(7) 
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\ n=l / fe=l 

The distribution of Fn has no dependence on the ordering of the indices in 
[N]. Hence, the distribution of Fn depends only on the same quantities — the 
number of indices and the feature block sizes — and the feature multiplicities. So 
we see that the IBP construction yields an exchangeable random feature alloca- 
tion. Consistency follows from the recursive construction and exchangeability. 
Therefore, Eq. ^ is seen to be in EFPF form (cf. Eq. (jH)). ■ 

Above, we have seen two examples of how specifying a conditional distribu- 
tion for the block membership of index n given the block membershi p of indices 



in [n ~ 1] (also called a prediction rule [Hansen and Pitmanl . Il998| ) yields an 



exchangeable probability function: e.g. the EPPF in the CRP case (Example[T]) 
and the EFPF in the IBP case (Example [S]) . We will see next that the predic- 
tion rule can conversely be recovered from the exchangeable probability function 
specification and therefore the two are equivalent. 



3.3 Induced allocations and block labeling 

In Examples [T] and [S] above, we formed partitions and feature allocations in the 
following way. For partitions, we assigned labels Z„ to each index n. Then we 
generated a partition of [N] from the sequence {Zn)n=i by saying that indices 
m and n are in the same partition block (m ^ n) if and only if Z„ = Zm- 
The resulting partition is called the induced partition given the labels (Zn)- 
Similarly, given labels (.^n)^i, we can form an induced partition of N. It is 
easy to check that, given a sequence (^n)^i, the induced partitions of the 
subsequences {Zn)n=ii will be consistent. 

In the feature case, we first assigned label collections Yn to each index n; y„ is 
interpreted as a set containing the labels of the features to which n belongs, and 
we assume it has finite cardinality. In this case, we generate a feature allocation 
on [N] from the sequence (1^)^=1 by first letting {4>k}^=i be the set of unique 
values in [j^^i Yn- Then the features are the collections of indices with shared 
labels: /at = {{n : cpk & Yn} : k = 1, . . . , K}. The resulting feature allocation 
/„ is called the induced feature allocation given the labels (Yn). Similarly, given 
label collections {Yn)'^^i, where each Yn has finite cardinality, we can form an 
induced feature allocation of N. As in the partition case, given a sequence 
{Yn)^^i, we can see that the induced feature allocations of the subsequences 
{Yn)n=i, will be consistent. 

In reducing to a partition or feature allocation from a set of labels, we 
shed the information concerning the labels for each partition block or feature. 
Conversely, we introduce order of appearance labeling schemes to give partition 
blocks or features labels given a partition or feature allocation. 

In the partition case, the order of appearance labeling scheme assigns the 
label 1 to the partition block containing index 1. Recursively, suppose we have 
seen n indices in k different blocks with labels {!,..., fc}. And suppose the 
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n+ 1st index does not belong to an existing block. Then we assign its block the 
label fc + 1. 

In the feature allocation case, we note that index 1 belongs to features. 
If — 0, there are no features to label yet. If > 0, we assign these 
features labels in {1, . . . , K^} and note that there are ways of doing so. 
Unless otherwise specified, we suppose that the labels are chosen uniformly at 
random. Let Ki = . Recursively, suppose we have seen n indices and Kn 
different features with labels {1, . . . , Kn}- Suppose the n + 1st index belongs to 
K^+i features that have not yet been labeled. If K^^-^ ~ 0, there are no new 
features to label. If K^_^_-^ > 0, we let Kn+i — Kn + K^_^_-^ and assign these 
Kn+i features labels in {Kn + 1, . . . , Kn+i}, e.g. uniformly at random. 

We can use these labeling schemes to find the prediction rule, which makes 
use of partition block and feature labels, from the EPPF or EFPF. First, 
consider a partition with EPPF p. Then, given labels {Zn)n^i with Kjsi = 
max{Zi, . . . , Zn}, we wish to find the distribution of the label Zrf+i- Using 
an order of appearance labeling, we know that either Zat+i G {Zi,...,Zjv} 
or Zpf+i = Kn + 1. Let wpf = {A^^i, . . . , An^k^} be the partition induced 
by (^„)^=i. Let Nnm = |^JV,fc|. So NN+i',k = Nk + l(Zjv+i = fc) for 
fc = 1, • • • , Kpf+i, and ifjv+i = Kn + 1{Zn+i > Kn} is the number of par- 
tition blocks in the partition of [iV + 1]; here, we let Nn,Kk+i = 0- Then the 
conditional distribution satisfies 

17 7 ^ F{Zi,...,Zn,Zn+i = z) 

nzM+, ^ . . . , zn) = — p(Zi,...,z^) — ■ 

But the probability of a certain labeling is just the probability of the underlying 
partition in this construction, so 

F[Zn+i = z\Zi, . . . , Zn) ^ 



p{Nn^,...,Nn. 



Example 6 (Chinese restaurant process). We continue our Chinese restaurant 
process example by deriving the Chinese restaurant table assignment scheme 
from the EPPF in Eq. ©. Substituting in the EPPF for the CRP, we find 

P{Zn+i^ z\Zi,...,Zn) 
^ p{Nn,i,...,Nk^+,) 

P{Nns, ■ • ■,NN,Kr^) 



0^"-' Uk=iiNN,k - 1)!) {{0 + i)N-iny 



just as in Eq. ([2]). 
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To find the feature allocation prediction rule, we now imagine a feature al- 
location with EFPF p. Here we must be slightly more careful about counting 
due to feature multiplicities. Suppose that after TV indices have been seen, 
we have label collections (Fn)^=i, containing a total of Km features, labeled 
{1, . . . , Kjq}. We wish to find the distribution of Ym+i- Suppose + 1 belongs 
to features that do not contain any index in [A^]. Using an order of ap- 

pearance labeling, we know that, if K^^^ > 0, the K^^^^ new features have 
labels if AT -I- 1, ... , Kn + if^v+i- Let Jn = {^i, • ■ • , j^Kn } be the feature alloca- 



tion induced by (Yn)!^^i. Let A^, 



N,k 



N,k 



be the size of the fcth feature. So 



NN+i,k = Afjv,fc + G Yn+i}, where we let NKn+j — for all of the features 
that are first exhibited by index A^-|-l: j S {!,..., K^^^}. Further, let the 



N+l 

number of features, including new ones, be written ii'jv+i = 
the conditional distribution satisfies 



K 



+ 

N+l' 



Then 



z\Yu....Yn) 



,Yn,Yn 



+1 



P(yi,...,>Ar) 



As we assume that the labels Y are consistent across A^, the probability of a 
certain labeling is just the probability of the underlying feature allocation and 
a combinatorial term accounting for the possible orderings of the new features: 



N+l 



,Yn) 



N+1,1, 



(9) 



Example 7 (Indian buffet process) . Just as we derived the Chinese restaurant 
process prediction rule (Eq. from its EPPF (Eq. ^) in Example [SJ so can 
we derive the Indian buffet process prediction rule from its EFPF (Eq. (O) by 
using Eq. Substituting the IBP EFPF into Eq. (O, we find 



nZn+i= z\Zi,...,Zn) 



(07)^«+^exp {-ejj:^+,\ 



jKn+1 r(jVjv+i.fc)r((jv+i)-jViv+i.fc+9) 
r{{N+i)+e) 



T{N+e) 



exp 



6*7 



+ (A^ + i)-i; v^ + (^ + i)-i 



6*7 



+ (7V+ 1) - 



k = KN + l 



n 

k=l 



N ■ 



Pois K 



6*7 



7V+ll 



+ (Ar + l)- 



—-^yi[BcTn(l{kez}\ 



k=l 



NN,k 

'N + e 
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The final line is exactly the Poisson likelihood for the number of new features 
times the Bernoulli likelihoods for the draws of existing features, as described 
in Example [SJ I 



3.4 Inference 

The prediction rule formulation of the EPPF or EFPF is particularly useful in 
providing a means of inferring partitions and feature allocations from a data 
set. In particular, we assume that we have data points Xi, . . . ,Xn generated 
in the following manner. In the partition case, we generate an exchangeable, 
consistent, random partition Hn according to the distribution specified by some 
EPPF p. Next, we assign each partition block a random parameter that charac- 
terizes that block. To be precise, for the fcth partition block to appear according 
to an order of appearance labeling scheme, give this block a new random label 
(jjk ^ H, for some continuous distribution H. For each n, let Z„ = 0^ where k 
is the order of appearance label of index n. Finally, let 

X„™^^^F(Z„), (10) 

for some likelihood F. The choices of both H and F are specific to the problem 
domain. 

Note that the sequence {Zn)n=i is sufficient to describe the partition Hat 
since n^r is the collection of blocks of [N] with the same label values Z„. The 
continuity of H is necessary to guarantee the a.s. uniqueness of the block values. 
So, if we can describe the posterior distribution of {Zn)^^i, we can in principle 
describe the posterior distribution of Hn- 

The posterior distribution of {Zn)n=i conditional on {Xn)n^i cannot typi- 
cally be solved for in closed form, so we turn to a method that approximates 
this posterior. We will see that prediction rules facilitate the design of a Markov 
Chain Monte Carlo (MCMC) sampler, in which we approximate the desired pos- 
terior distribution by a Markov chain of random samples proven to have the true 
posterior as its equilibrium distribution. 

In the Gibbs sampler formulation of MCMC [Ceman and Genian . 1984 1, 



wc sample each parameter in turn and conditional on all other parameters in 
the model. In our case, wc will sequentially sample each element of {Zn)n^i. 
The key observation here is that (^n)^^=i is an exchangeable sequence. This 
observation follows by noting that the partition is exchangeable by assumption, 
and the sequence (0fe) is exchangeable since it is iid; (Zn) is an exchangeable 
sequence since it is a function of (n„) and {4>k)- Therefore, the distribution 
of Zn given the remaining elements Z_„ := (Zi, . . . , . . . , Zjv) is 

the same as if we thought of Z„ as the final, iVth element in a sequence with 
TV — 1 preceding values given by Z_„. And the distribution of Zn given Z_Ar 
is provided by the prediction rule. The full det ails of the Gibbs sample r for 



the C R -P in Examples [J and El w ere introduced bv lEscobaij 1994]. MacEa cher 



200C | 



1994 1 , lEscobar and West) [l995l | and are covered in fuller generality by iNea" 



dierrJ 
NeaJ 
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It is worth noting that the sequence of order of appearance labels is not 
exchangeable; for instance, the first label is always 1. However, the prediction 
rule for Zpf given {Zi, . . . , Zjsi-i) breaks into two parts: (1) the probability 
of Zn taking a value either in {Zi, . . . , Z^^i} or a new value and (2) the 
distribution of Z^ when it takes a new value. When programming such a 
sampler, it is often useful to simply encode the sets of unique values, which 
may be done by retaining any set of labels that induce the correct partition 
(e.g. integer labels) and separately the set of unique parameter values. Indeed, 
updating the parameter values and partition block belon ging separately can lead 
to improved mixing of the sampler jMacEachern , 1994 1 . 



Similarly, in the feature case, we imagine the following generative model for 
our data. First, let Fm be a random feature allocation generated according to the 
EFPF p. For the fcth feature block in an order of appearance labeling scheme, 
assign a random label (f)k ^ H to this block for some continuous distribution 
H. For each n, let Yn = {(pk ■ k e J„}, where J„ is here the set of order of 
appearance labels of the features to which n belongs. Finally, as above, 

where the data likelihood F and parameter distribution H are again application- 
specific and where now F depends on the variable-size collection of parameters 
in y„. 

Again, we observe that although the order of appearance label sets arc not 
exchangeable, the sequence (Yn) is. This fact allows the formulation of a Gibbs 
sampler via the observation that the distribution of Yn given the remaining 
elements Y_„ := (Yi, . . . , Yn-i, Yn+i, . . . , Y^) is the same as if we thought of 
Yn as the final, A^th element in a sequence with — 1 preceding values given 
by Y_„. Th e full details of such a sampler in the IBP case (Examples [5] and [7]) 



are given bv [Griffiths and Ghahramanil 



ipierm i 
J [2003. 



As in the partition case, in practice when programming the sampler, it is 
useful t o separate the feature allocation encoding from the feature parameter 



values. iGriffiths and Ghahramanil |2006l | describe how left order form matrices 



give a convenient representation of the feature allocation in this context. 



4 Stick lengths 

Not every symmetric function defined f or an arbitrary number of arguments with 
values in the unit interval is an EFPF |Pitmanl . [l995l |. and not every symmetric 



function with an additional positive integer argument is an EFPF. For instance, 
the consistency property in Eq. ([T]) implies certain additivity requirements for 
the function p. 

Example 8 (Not an EPPF) . Consider the function p defined with 
p(l) = l, Kl,l) = 0.1, P(2)=0.8, ... 
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From the information above, p may be further defined so as to be symmetric 
in its arguments for any number of arguments, but since it does not satisfy 
p(l) =p(l,l) +p(2), it cannot be an EPPF. ■ 

Example 9 (Not an EFPF). Consider the function p defined with 

p(Ar = l)=0.9, p(A^= 1,1) = 0.9, p(A^ = 1,1,1) = 0.9, ... 

From the information above, p may be further defined so as to be symmetric in 
its arguments for any number of arguments after the initial N argument, but 
since {Q\)-^p{N = 1) + (1!)" =1,1) + (2!)-V(^ = 1, 1, 1) > 1, it cannot 
be an EPPF. ■ 

It therefore requires some care to define a suitable distribution over consis- 
tent, exchangeable random feature allocations or partitions using the exchange- 
able probability function framework. 

Since we are working with exchange able sequences of random variables, it is 



natural to turn to de Finetti's theorem |De Finettil . Il93lj for clues as to how to 



proceed. De Finetti's theorem tells us that any exchangeable sequence of ran- 
dom variables can be expressed as an independent and identically distributed 
sequence when conditioned on an underlying random mixing measure. While 
this theorem may seem difficult to apply directly to, e.g., exchangeable parti- 
tions, it may be applied more naturally to an exchangeable sequence of numbers 
derive d from a sequence of partitions. The argument below is due to Aldoud 
1985^ . 



Suppose that (n„) is an exchangeable, consistent sequence of random par- 
titions. Consider the fcth partition block to appear according to an order of 
appearance labeling scheme, and give this block a new random label (f)k ~ 
Unif([0, 1]) such that each random label is drawn independently from the rest. 
This construction is the same as the one used for parameter generation in Sec- 
tion [Hill and (n„) is exchangeable by the same arguments used there. 

If we apply de Finetti's theorem to the sequence (Z„) and note that (Z„) has 
at most countably many diflterent values, we see that there exists some random 
sequence {pk) such that pk G (0, 1] for all k and, conditioned on the frequencies 
(yOfe), {Zn) has the same distribution as iid draws from {pk)- In this description, 
we have brushed over technicalities associated with partition blocks that contain 
only one index even as — > oo (which may imply '^^Pk < !)■ 

But if we assume that every partition block eventually contains at least 
two indices, we can achieve an exchangeable partition of \N] as follows. Let 
{pk) represent a sequence of values in (0,1] such that Pfc 1- Draw 

Zn Discrcte((/9fc)fc). Let n^v be the induced partition given {Zn)n=i- Ex- 
changeability follows from the iid draws, and consistency follows from the in- 
duced partition construction. 

When the frequencies (pk) are thought of as subintervals of the unit interval, 
i.e. a partition of the unit interval, they are collectively called Kingman's paint- 
box [Kingman . 1978|. As another naming conv ention, we may think of the unit 



interval as a stick [Ishwaran and Jamea . l200l| . We partition the unit interval 
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1 



Vi 



14(1 - Vi) 



l-Vi 



Figure 3: An illustration of how stick-breaking divides the unit interval into 
a sequence of probabilities. The stick proportions (Vi, V2, • • • ) determine what 
fraction of the remaining stick is appended to the probability sequence at each 
round. 



by breaking it into various stick lengths, which represent the frequencies of each 
partition block. 

A similar construction can be seen to yield exchangeable, consistent random 
feature allocations. In this case, let (^fc) represent a sequence of values in (0, 1] 

such that 2_/fc=i Cfc < oo- We generate feature collections independently for 
each index as follows. Start with y„ = 0. For each feature fc, add k to the set 
Yn, independently from all other features, with probability ^fc. Let be the 
induced feature allocation given Exchangeability of Fjsi follows from 

the iid draws of F„ , and consistency follows from the induced feature allocation 
construction. The finite sum constraint ensures each index belongs to a finite 
number of features a.s. 

It remains to specify a distribution on the partition or feature frequencies. 
The frequencies cannot be iid due to the finite summation constraint in ei- 
ther case. In the partition case, any infinite set of frequencies cannot even 
be independent since the summation is fi xed to one. One scheme to ensure 
summation to unity is called stick-breaking McCloskevl Il965l iPatil and Taillid . 
1977 , Sethuraman . 1994llshwaran and Jamej2001 |. In stick-breaking, the stick 



lengths are obtained by recursively breaking off proportions of the unit interval 
to return as the atoms pi,p2, . . . (cf. Figure [3]). In particular, we generate the 
stick-breaking proportions Vi,V2, ■ ■ . as [0, l]-valued random variables. Then 
pi is the first proportion Vi times the initial stick length 1; hence pi = Vi. 
Recursively, after k breaks, the remaining length of the initial unit interval is 
]^^'^-^(l — Vj). And pk+i is the proportion Vk+i of the remaining stick; hence 

The stick-breaking construction yields pi,p2,-.- such that pk G [0,1] for 
each k and J^'kLi Pk ^ 1- If the Vk do not decay too rapidly, we will have 
Sfe°=i Pk 1- In particular, the partition block proportions pk sum to unity 
a.s. iff there is no remaining stick mass: Ofc^ill ^ ^fe) 0- 



18 



Urn 

Cluster 12 3 4 




Cluster value = Urn number 
Cluster value > Urn number 
Cluster value < Urn number 



I I I i \4~Beta(l,e) 
Vi V2 Ki 

Figure 4: An illustration of the Polya urn proof that Dirichlct process stick- 
breaking gives the underlying partition block frequencies for a Chinese restau- 
rant process model. The fcth column in the central matrix corresponds to an 
tallying of when the fcth table is chosen (gray), when a table of index larger 
than k is chosen (white), and when an index smaller than k is chosen (x). If we 
ignore the x tallies, the gray and white tallies can be modeled as balls drawn 
from a Polya urn. The limiting frequency of gray balls in each column is shown 
below the matrix. 



We often make the additional, convenient assumption that the Vk are in- 
dependent. In this case, a necess a,ry and sufficient coiidition for ^ 
1 is EfcliE[log(l - Vfc)] = -00 [ishwaran and Jamesl l200l| . When the 14 



are independent and of a canonical distribution, they are easily simulated. 
Moreover, if we assume that the 14 are such that the pu decay sufficiently 
rapidly in fc, one strategy for simulating a stick-breaking model is to ignore 
all k > K for some fixed , finite K. This approximation is known as trunca- 
tion llshwaxan and .Tame^ EHHT^. It is fortuitously the case that in some models 



of particular interest, such useful assumptions fall out naturally from the model 
construction (e.g. Examples [TUl and [TT|) . 

Examp le 10 (Ch i nese r estaurant process). In the original result due to de 
Finetti |De Finetti . 1931 |. the exchangeable random variables were zero/one- 



valued and the mixing measure was a distribution on a single frequency so 
that the outcomes were conditionally Bernoulli. We will find a similar result 
in obtaining the stick-breaking proportions for the Chinese restaurant process 
random partition model here. 

We can construct a sequence of binary- valued random variables by dividing 
the customers in the CRP who are sitting at the first table from the rest; color 
the former collection of customers gray and the latter collection of customers 
white. Then, we see that the first customer must be colored gray. And thus 
we begin with a single gray customer and no white customers. This binary 
valuation for the first table in the CRP is illustrated by the first column in 
Figure m 

At this point , it is useful to recall the Polya urn construction [Polya, I1930I 



Freedmanl . Il965| , whereby an urn starts with Gq gray balls and Wq white balls 



At each round iV, we draw a ball from the urn, replace it, and add k of the 
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same color of ball to the urn. At the end of the round, we have Gn gray 
balls and Wn white balls. Despite the urn metaphor, the number of balls 
need not be an integer at any time. By checking the CRP Eq. we can 
see that the coloring of the gray/white customer matrix assignments starting 
with the second customer has the same distributions as a sequence of balls 
from a Polya urn as a Polya urn with Gi^ — 1 initial gray balls, Wi^ = 9 
initial white balls, and ki = 1 replacement balls. Let Gi^n and Wi^n represent 
the numbers of gray and white balls, respectively, in the urn after N rounds. 
The important fact about the Polya urn we use here is that there exists some 
V - Beta(Go/K, Wo/k) such that ^-^(GAr+i - Gat) - Bern(y) for all N. In 
this particular CRP case, then, Gi^n^i — Gi^n is one if a customer sits at the 

first table, and Gi^n+i — Gi^n ~' Bern(yi) with Vi ~ Beta(l, 9). 

We now look at the sequence of customers who sit at the second and subse- 
quent tables. That is, we condition on customers not sitting at the first table 
or equivalently on the sequence with Gi^at+i — Gi^n = 0. Again, we have that 
the first customer sits at the second table, by the CRP construction. Now let 
customers at the second table be colored gray and customers at the third and 
later tables be colored white. This valuation is illustrated in the second column 
in Figure IH each x in the figure denotes a data point where the first partition 
block is chosen and therefore the current Polya urn is not in play. As before, 
we begin with one gray customer and no white customers. We can check the 
CRP Eq. ([2|) to see that customer coloring once more proceeds according to a 
Polya urn scheme with G2,o = 1 initial gray balls, W2,o = 9 initial white balls, 
and K2 = 1 replacement balls. Thus, contingent on a customer not sitting at 
the first table, the A^th customer sits at the second table with iid distribution 
Bern(V2) with V2 ^ Beta(l, 9). And V2 is independent of Vi. 

The argument just outlined proceeds recursively to show us that the A^th 
customer, conditional on not sitting at the first K — 1 tables for K > l, sits 
at the Kth table with iid distribution Bern(yR') and Vk ~ Bcta(l,6') with Vk 
independent of the previous (Vi, . . . , Vk-i)- 

Combining these results, we see that we have the following construction 
for the customer seating patterns. The Vk are distributed independently and 
identically according to Bcta(l,0). The probability pK of sitting at the Kth. 
table is the probability of not sitting at the first A' — 1 tables, conditional on 
not sitting at the previous table, times the conditional probability of sitting 
at the i^th table: px = Ylk=i(^ — Vk) ■ Vk- Finally, with the vector of 
table frequencies (pfc), each customer sits independently and identically at the 
corresponding vector of tables according to these frequencies. This process is 
summarized here: 

Vk - Beta(l,6') 

K 



PK -.^VKllil-Vk) 

k=l 

Zn ""^ Discrete((pfc)fc). (11) 
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Urn / Buffet dish 
12 3 4 




UU U I 

Vi.i Ki.2 Vi,, v,,i\ \ v:,.i I 



Key 



I Dish chosen by this customer 

Dish not chosen by this customer 
, Dish not yet chosen by anyone 

V,>.*"~''Beta(l,0 + n-l) 



1^2,1 V2.2 K5.1 



Figure 5: Illustration of the proof that the frequencies of features in the Indian 
buffet process are given by beta random variables. For each feature, we can 
construct a sequence of zero/one variables by tallying whether (gray, one) or 
not (white, zero) that feature is represented by the given data point. Before the 
first time a feature is chosen, we mark it with an x . Each column sequence of 
gray and white tallies, where we ignore the x marks, forms a Polya urn with 
limiting frequencies shown below the matrix. 



The feature case is easier. Since it does not require the frequencies to sum 
to one, the random frequencies can be independent so long as they have an a.s. 
finite sum. 

Example 11 (Indian buffet process). We use a similar urn approach to the 
CRP case to recover stick lengths in the Indian buffet process. 

Recall that on the first round of the Indian buffet process, ^ Pois(7) 
features are chosen to contain index n. Consider one of the features, labeled k. 
Each future data point N belongs to this feature with probability MN,k/{0 + 
N — 1). Thus, we can model the sequence after the first data point as a Polya 
urn of the sort encountered in Example [TU] with initially Gfc.o = 1 gray balls, 
Wk,o = white balls, and = 1 replacement balls. As we have seen, there 
exists a random variable Vk ^ Beta(l, 6) such that representation of this feature 
by data point N is chosen, iid across all N, as Bern(Vfe). Since the Bernoulli 
draws conditional on previous draws are independent across all fc, the Vk are 
likewise independent of each other; this fact is also true for k in future rounds. 
Draws according to such an urn are illustrated in each of the first four columns 
of the matrix in Figure [5] 

Now consider any round n. According to the IBP construction, ~ 
Pois{j6 / {9+n — l)) new features are chosen to include index n. Each future data 
point N represents feature k among these features with probability Mff_k/{(^ + 
N — n). In this case, we can model the sequence after the nth data point as 
a Polya urn with G^.o = 1 initial gray balls, W^^ = 9 + n — 1 initial white 
balls, and Kk = 1 replacement balls. So there exists a random variable Vk ~ 
Beta(l, 9 + n — l) such that representation of feature k by data point N is chosen, 
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iid across all A^, as Bern(Vfe). 

F inally, then, we have t he fol lowinK generative model for the feature alloca- 



tion 



Thibaux and JordanI , [2007t . 



indep 



Pois 



76I 



Kn = Kn-l 

Vk '"^^^ Beta(l, e + n-l), k = Kn-i + 1, 

indep . 



In,k 



Bern(l^fc), k = l,...,Kn 



(12) 



In,k is an indicator random variable for whether feature k contains index n. The 
collection of features to which index n belongs, F„, is the collection of features 
k with In,k = 1- B 



4.1 Inference 

As we have seen above, the exchangeable probability functions of Section [3] are 
the marginal distributions of the partitions or feature allocations generated ac- 
cording to stick-length models with the stick lengths integrated out. It has been 
proposed that includ ing the stick lengths in MCM C samplers of these models 
will improve mixing [ishwaran and Zarepour , 2000l |. While it is impossible to 
sample the countably infinite set of partition block or feature frequencies in 
these models (cf. Examples [10] and ITl T). a number of ways of gettin g around 
this difficulty have been investigated. Ilshwaran and Zarepour examine 
two separate finite approximations to the full CRP stick length model; one uses 
a parametric approximation to the full infinite model, and the other creates a 
truncation by setting the stick break at some fixed size K to be 1: Vk = 1- 
However, retrospective sampling [Papaspiliopoulos and Robertd . 2008 1 and slice 
sampling jWalkeillioOTl l can both be used to avoid any approximations and deal 
instead directly with the full model. 

While our inference discussion thus far has focused on MCMC sampling 
as a means of approximating the posterior distribution of either the block as- 
signments or both the block assignments and stick lengths, including the stick 
lengths in a posterior analysis facilitates a different posterior approximation; 
in particular, variational methods can be used to approximate the posterior by 
minimized some sense of di stance to the poste rior over a family of potential 
approximating distributions Jordan et al. . ll999| . The practicality and, indeed, 



speed of these methods in t he case of stick-b r eakin g for the CRP (Example ITO)) 
have been demonstrated bv lBlei and Jordan! 

A number of different models for the stick lengths corresponding to the 
features of an IBP (Example [TT|) ha ve been discovered. The di stributions de- 
scribed in Example [TT] a re covered b y iThibaux and Jordan 2007 1 , who build on 




work from 



A special case of the IBP is examined by 
who detail a slice sampling algorithm for sampling from the 
posterior of the stick lengths and feature assignments. Yet another stick length 
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model for the IBP is explored by IPaislev et al. |2010| , who show how to apply 
variational methods to approximate the posterior of their model. 

Stick length modeling has the further advantage of allowing inference in cases 
where it is not straightforward to integrate out the underlying stick lengths to 
obtain a tractable exchangeable probability function. 



5 Subordinators 

An important point to reiterate about the labels Z„ and label collections Y„ 
is that when we use the order of appearance labeling scheme for partition or 
feature blocks described above, the random sequences (Z„) and (Y„) are not 
exchangeable. Often, however, we would like to make use of special proper- 
ties of exchangeability when dealing with these sequences. For instance, if we 
use Markov Chain Monte Carlo to sample from the posterior distribution of a 
partition (cf. Section [5?^ . we might want to Gibbs saniple Yn given {Ym}\Yn. 
This sampling is particularly easy in some cases |Neal . 2000| if we can treat 



Yn as the last random variable in the sequence, but this treatment requires 
exchangeability. 

A way to get around this dilemma was suggested by lAldouj jl98,5j and 
appeared above in our motivation for using stick lengths. Namely, we assign to 
the fcth partition block a uniform random label 4>k ~ Unif([0, 1]); analogously, 
we assign to the fcth feature a uniform random label (j)k ^ Unif([0, 1]). We can 
see that in both cases, all of the labels are a.s. distinct. Now, in the partition 
case, let Z„ be the uniform random label of the partition block to which n 
belongs. And in the feature case, let Yn be the (finite) set of uniform random 
feature labels for the features to which n belongs. We can recover the partition 
or feature allocation as the induced partition or feature allocation by grouping 
indices assigned to the same label. Moreover, as discussed above, we now have 
that each of and (Yn) is an exchangeable sequence. 

If we form partitions or features according to the stick length constructions 
detailed in Section SI we know that each unique partition or feature label (j)k is 
associated with a frequency ■ We can imagine this association in the form of 
a random measure: 

oo 
k=l 

In the partition case, = 1, so the random measure is a random probability 

measure, and we may draw Z„ fi. In the feature case, the weights have a 
finite sum but do not necessarily sum to one. In the feature case, we draw Yn 
by including each 4>k for which Bern(^fc) yields a draw of 1. 

Another way to codify the random measure in Eq. (jl3p is as a monotone 
increasing stochastic process on [0, 1]. Let 



T,^J2W(bk < s}. 



fc=i 
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Figure 6: Left: The sample path (Ts) of a subordinator. Tf is the left limit 
of {Ts) at s = s. Right: The right-continuous inverse (St) of a subordinator. 
The open intervals along the t axis correspond to the jumps of the subordinator 



Then the atoms of fi are in one-to-one correspondence with the jumps of the 
process T. 

Taking this increasing random function approach gives us another means 
of choosing distributions for the weights £,k- We have already seen that these 
cannot be iid distributed due to the finite summation condition — or even the 
normalized version of iid random variables in the partition case. However, we 
will see that if we specify that the increments of a monotone, increasing stochas- 
tic process are independent and stationary, then we can use the jumps of that 
function as the atoms in our random measure for partitions or features. 



Definition 12. A subordinator Bochner . 1955 . Bertoin . 19981 2004 is a stochas- 
tic process {Tg , s > 0) that has 

• Non-negative, non-decreasing paths (a.s.) 

• Cadlag paths (i.e., paths that are right-continuous with left limits) 

• Stationary, independent increments. 

For our purposes, wherein the subordinator values will ultimately correspond 
to (perhaps scaled) probabilities, we will assume the subordinator takes values 
in [0, oo) though alternative ranges with a sense of ordering are possible. 

Subordinators are of interest to us because not only do they exhibit the sta- 
tionary, independent increments property but they can always be decomposed 
into two cornponen ts: a deterministic drift component and a Poisson point pro- 
cess Bertoin . 1998| : 



24 



Theorem 13. Every subordinator {Ts,s > 0) can be written as 



= cs + ^ CkSci,^<s 

k=l 

for some constant c > and where {{^.k, 4'k)}k *s the countable set of points of 
a Poisson process with intensity A(d^) d(j), where A is a Levy measure, i.e. 

(1 A<e)A(dO < oo. 

In particular, then, if a subordinator is finite at time t, the jumps of the 
subordinator up to t may be used as feature block frequencies if they have sup- 
port in [0, 1]. Or, in general, the normalized jumps may be used as partition 
block frequencies. In either case, we have substituted the condition of indepen- 
dent and identical distribution with a more natural continuous-time analogue: 
independent, stationary intervals. 

Just as the Laplace transform of a positive random variable characterizes 
the distribution of that random variable, so does the Laplace transform of the 
subordinator — which is a positi ve random va r iable at any fixed time point — 



describe this stochastic process [Bertoinl . Il998l 12004 1 . 



Theorem 14 (Levy-Khinchin formula for subordinators) . // {Ts,s > 0) is a 
subordinator, then 

E(e-^^0 = e"**^'' (14) 

with 

POO 

vI'(A) =c\+ (1 - e-^«)A(dO, (15) 
Jo 

where c > is called the drift constant and A is a non-negative, Levy measure 
on (0, co). 

The Laplace transform is called the Laplace exponent in this context. We 
note that a subordinator is characterized by its drift constant and Levy measure. 

Using subordinators for feature allocation modeling is particularly easy; since 
the jumps of the subordinators are formed by a Poisson point process, we can 
use Poisson process methodology to find the stick lengths and EFPF that result 
when we choose each index's feature belonging according to Bernoulli draws at 
the jumps of the subordinator. 

Example 15 (Indian buffet process). So far, we have found a collection of stick 
lengths to represent the featural frequencies for the IBP (Eq. (|T2|) of Example [Til 
in Section U]). To se e the conne ction to subordinators, we start from the beta 
process .subordinator |Kiml . Il999j with zero drift (c = 0) and Levy measure 

A{dO = ier'{i-0'-' dc (16) 

We will see that the mass parameter 7 > and concentration parameter 9 > 
are the same as those introduced in Example [5] and continued in Example 1111 
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Figure 7: An illustration of Poisson thinning. The x-axis values of the filled 
black circles, emphasized by dotted lines, are generated according to a Poisson 
process. The [0, l]-valued function h{x) is arbitrary. The vertical axis values of 
the points are uniform draws in [0, 1]. The "thinned" points are the collection of 
X-axis values corresponding to vertical axis values below h{x) and are denoted 
with a X symbol. 



Theorem 16. The size-biased jumps of the beta process subordinator with Levy 
density given by Eq. i t,? 6]) have the same distribution as the IBP stick lengths 
given by Eq. US^) in Example \S[ namely: 

indep ^ . y 



n-1 



K+ ~ Pois 

Vk "'i'" Beta{l,e + 71-1), k = Kn-i + 1, . . . , 
In,k ™~ ^ Bern{Vk), fc = 1, . . . , A'„. 



Proof. Recall the following fact about Poisson thinning [Kingmaii , 1993l | , illus 



trated in Figure [T] Suppose that a Poisson point process with rate measure 
A generates points with values x. Then suppose that, for each such point x, 
we keep it with probability h{x) G [0, 1]. The resulting set of points is also a 
Poisson point process, now with rate measure X' (A) = \{dx)h{x) dx. 

Consider again the proposal to generate feature membership from a subor- 
dinator by taking Bernoulli draws at each of its jumps with success probability 
equal to the jump size. Since every jump has strictly positive size, the feature 
associated with each jump will eventually score a Bernoulli success for some 
index n with probability one. Therefore, we can enumerate all jumps of the 
process by first enumerating all features in which index 1 appears, then all fea- 
tures in which index 2 appears but not index 1, and so on; at the nth iteration, 
we enumerate all features in which index n appears but not previous indices. 

We prove Theorem [16] recursively. Define the measure 
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so that ^0 is the beta process Levy measure A in Eq. We make the recursive 
assumption that fin is distributed as the beta process measure without atoms 
corresponding to features chosen on the first n iterations. 

There are two parts to proving Theorem [TBI First, we show that, on the 
nth iteration, the number of features chosen and the distribution of the corre- 
sponding atom weights agree with Eq. p^ . Second, we check that the recursion 
assumption holds. 

For the first part, note that on the nth round we choose features with prob- 
abihty equal to their atom weight. So we form a thinned Poisson process with 
rate measure ^ • This rate measure has total mass 

J + ?l — 1 

So the number of features chosen is Poisson-distributed with mean ^6{9 + n ~ 
as desired. And the atom weights have distribution equal to the normal- 
ized rate measure 

■ lOC\l - 0'+""' = Beta(e|l,0 + n - l)dt 

as desired. 

Finally, to check the recursion assumption, we note that those sticks that 
remained were chosen for having Bernoulli failure draws; i.e., they were chosen 
with probability equal to one minus their atom weight. So the thinned rate 
measure for the next round is 

which is just /i„. □ 

The form of the EFPF of the feature allocation generated from the beta 
process subordinator follows immediately from the stick length distributions we 
have just derived by the discussion in Example 1111 in Sectional ■ 

We see from the previous example that feature allocation stick lengths and 
EFPFs can be obtained in a straightforward manner using the Poisson process 
representation of the jumps of the subordinator. Partitions, however, are not as 
easy to analyze, principally due to the fact that the subordinator jumps must 
first be normalized to obtain a probability measure on [0, 1], rather than just a 
random measure with finite total mass. We must compute the stick lengths and 
EPPF using partition block frequencies from these normalized jumps instead of 
the direct subordinator jumps. 

In the EPPF case, we make use a of a result that gives us the exchangeable 
probability function as a function of the Laplace expo nent. Though we do not 



derive this formula here, its derivation can be found in lPitmanI |2003j : the proof 
relies on, first, calculating the joint distribution of the subordinator jumps and 
partition generated from the normalized jumps and, second, integrating out the 
subordinator jumps to find the partition marginal. 
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Theorem 17. Let (n„) be a consistent set of exchangeable partitions. For each 
exchangeable partition tt^v = {^i, • • ■ , ^k} of [N] with |^fc| for each k, 

= {Ai, . . . , Ak}) = p{,Nu ...,Nk) 



(-1) 



N-K 



{N-l)\ 



poo ^ 

■^0 fc=l 



Example 18 (Chinese restaurant process). We start by introducing the gamma 
process, a subordinator which we will see below generates the Chinese restau- 
rant process EPPF. The gamma process has Laplace exponent <&(A) (Eq. p^) 
characterized by 

c = 0, and K{dC) = BC^e'^^ di (18) 

for > and 6 > (cf. Eq. ([Tsj) in Theorem [T4|). We will see that d corresponds 
to the CRP concentration parameter d and that b is arbitrary and does not 
affect the partition model. 

We calculate the EPPF using Theorem [T7l 

Theorem 19. The EPPF for partition block membership chosen according to 
the normalized jumps {pk} of the gamma subordinator with parameter 9 is the 
CRP EPPF (Eq. f|);. 

Proof. By Thcorcm ll7l if we can find all order derivatives of 4*, we can calculate 
the EPPF for the partitions generated with frequencies equal to the normalized 
ju mps of this subordinator. The derivatives of ^, which are known to always ex- 
ist iBertoinl I2OO0I [Rogers and Williamj . |2000| | . are straightforward to calculate 



if we begin by noting that, from Eq. (jisp in Theorem 1141 we have 

poo 

*'(A) =c+ / ^e-^-A{dO. 
Jo 

Hence, 



*'(A) = / e-^^ee-^^ d^ = 
Jo 



\ + b 

Then simple integration and differentiation yield 

^{X) = eiog{x + b) 

vI/(")(A) = (-l)"-i%^-^. 
^ ^ ^ ' (A + 6)" 

We can substitute these quantities into the general EPPF formula in Eq. (|17p 
of Theorem [17] to obtain 



piN,,...,NK) 
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(7V-1)! 
(iV-1)! 



if 



D! 



fc=l 
if 



.k=l 



K 



T{N)T{e) 
T{N + 61) 

1 



' + l)jv-iti 



dX 



The penultimate line follows from the form of the beta prime distribution, and 
the last line is the CRP EPPF from Eq. ([3]), as desired. We note in particular 
that the parameter h does not appear in the final EPPF. □ 



Whenever the Laplace exponent of a subordinator is known, Theorem [17] 
can similarly be applied to quickly find the EPPF of the partition generated by 
sampling from the normalized subordinator jumps. 

To find the stick lengths, i.e. partition block frequency distributions, from 
the subordinator representation for a partition, we must find the distributions 
of the normalized subordinator jumps. 

Example 20 (Chinese restaurant process). We continue with the CRP example. 

Theorem 21. The size-biased, normalized jumps [pk), i-C- jumps in order of 
appearance, of the gamma subordinator with concentration parameter 6 (and 
arbitrary parameter b > 0) have the same distribution as the CRP stick lengths 
(Eq. ill]) of Example \10\ in Section [^) ; 



k-l 

Pk = Vkl[{l-Vj) for 

i=i 



Vj - Beta{l,( 



Proof. First, we introduce some notation. Let r = X^fcCfcj the sum over all of 
the jumps of the subordinator. Second, let Tk ~ t ~ X]j=iCfcj total sum 
minus the first k elements. Finally, let Wk — Tk/rk-i and 14 = 1 — Wk- Then 
a simple telescoping of factors shows that ~ Vk 11^=1 (1 ^ 

k-l 

i^in 



fe-i 

VkX{{l~V,) 

i=i 



Tfc-l 



1 



To 



r 



Pk 



It remains to show that the Vi. have the desired distribution. To that end, it 



is easier to work with the Wk . We will find the following lemma Pitman , 2006l | 
useful. 

Lemma 22. Let p be the density of A with respect to Lebesgue measure. And 
let f be the density of the distribution of t with respect to Lebesgue measure. 
Then 



P(ro G dta,...,Tk e dtk) 
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With this lemma in hand, the resuh foUows from a change of variables cal- 
culation; we use a bijection between {W^i, . . . , Wk^ T} and {tq, . . . , r^-} defined 
by Tfc = tJ^^^j^ Wj. The determinant of the Jacobian for the transformation to 
the latter from the former is 

fc-i 

3=0 

where we note that t^. is a function of r and {Wi, . . . , W^}- Then 

F{Wi edwi,...,Wk'E dwk,T e dto) 
= P(ro e dto,...,Tk £ dtk) - J 

(k-l 
J|(tj - tj+i)p(tj - tj + i) 
j=0 

In the case of the gamma process, we can read p(£ ) = ^£~^e~^^ f rom Eq. p8)) . 
The function / is determined by p and in this case Pitman . 2006l |: 

fit) = Gait\0,b) = b-'^T{0)-H'^-^e-''K 

So 

F{Wi e dwi,...,Wk e dwk,T e dto) 

o,ti-'e-''o ^t'-^e-^'flw'-' 

Since the distribution factorizes, the {Wk} are independent of each other and 
of T. Second, we can read off the distributional kernel of each Wk to establish 
Wk Bcta(0, 1), from whence it follows that Vk Beta(l, 9). □ 



5.1 Inference 

In some sense, we skipped ahead in describing inference in Sections 13.41 and 14. II 
There, we made use of the fact that random labels for partitions and features 
imply exhangeability of the data partition block assignments {Zn) and data 
feature assignments {Yn). In the discussion above, we study the object that 
associates random uniformly distributed labels with each partition or feature. 
Assuming the labels come from a uniform distribution rather than a general 
continuous distribution is a special case of the discussion in Section [Ol and we 
defer the general case to the next section (Section 15]). 
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We have seen above that it is particularly straightforward to obtain an EPPF 
or EFPF formulation, which yields Gibbs sampling steps as described in Sec- 
tion |X31 when the stick lengths are generated according to a normalized Poisson 
process in the partition case or a Poisson process in the feature case. Exam- 
ples and [TH] illustrate how to find such exchangeable probability functions. 
Further, as we have seen the usefulness of the stick representation in inference, 
Examples [15] and [20] illustrate how stick length distributions may be recovered 
from the subordinator framework. 

6 Completely random measures 

In the discussion of subordinators above, the jump sizes of the subordinator 
correspond to the feature frequencies or unnormalized partition frequencies and 
are the quantities of interest; the locations of the jump sizes are convenient 
labels that allow the sequence of index assignments {Zi, Z2, ■ ■ ■ in the clustering 
case or Yi, F2, ■ • ■ in the feature case) to be exchangeable. 

However, this labeling retains the same convenient properties as long as the 
labels are chosen iid from any continuous distribution (not just the uniform 
distribution), thereby guaranteeing that each partition block or feature has a 
unique label a.s. Moreover, in typical applications, we wish to associate some 
parameter with each partition block or feature. In the partition case, we typi- 
cally model the observed data indexed by n as being generated according to 
some likelihood depending on the parameter corresponding to its partition block 
assignment. Likewise, in the feature case, we typically model the observed data 
Xn indexed by n as being generated according to some likelihood depending 
on the collection of parameters corresponding to its collection of feature block 
assignments (cf. Eq. (fTOj)). 

In these cases, it can be useful to suppose that the partition block labels, 
or feature labels, tpk are not necessarily K+-valued but rather are generated 
iid according to some continuous distribution H on a, general space Then, 
whenever k is the order of appearance partition block label of index n, we let 
Zn = 4'k- Similarly, whenever k is the order of appearance feature label for 
some feature to which index n belongs, (j)k & Y^- Finally, then, we complete 
the generative model in the partition case by letting X„ '^'^^^ F{Zn) for some 
distribution function F depending on parameter Z„. And in the feature case, 
Xn ^"'^^ F{Yn), where now the distribution function F depends on the collection 
of parameters Yn ■ 

When we take the jump sizes (^fe) of a subordinator as the weights of atoms 
with locations {4'k) drawn iid according to H as described above, we find our- 
selves with a completely random measure ^: 



A completely random measure is a random measure fi such that whenever A 



00 




(19) 



fc=i 
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and A' are disjoint sets, we have that p.{A) and fJ-{A') are independent random 
variables. 

To see that changing the atom locations from a subordinator in this way 
yields a completely random measure, note that Theorem [T3] tells us that the 
subordinator jumps sizes are generated according to a Poisson point process, 
with s ome inten s ity m easure v{d^). The Marking Theorem for Poisson pro- 
cesses [Kingman . 1993| in turn yields that the tuples {{Ck,4'k)}k are generated 



according to a Poisson point process with intensity measure i'{d^)H(d(j)). By 



KingmanI jl967| . whenever the tuples {(^fe, (/>fc)}fe are drawn according to a Pois- 



son point process, the measure in Eq. (jl9p is completely random. 

Example 23 (Dirichlet process). We can form a completely random measure 
from the gamma process subordinator and a random labeling of the partition 
blocks. Specifically, suppose that the labels come from a continuous measure 
H. Then we genera te a completely random measure G called a gamma pro- 
cess Ferguson . 1973l | in the following way: 



iy{d^ X d(j)) = eC^e~''^d^ ■ H{d(f>) (20) 
{(6,'/'fe)}fc^PPPM (21) 

OO 

G = 5]e.'50. (22) 

fe=i 

Here, PPP(i^) denotes a draw from a Poisson point process with intensity mea- 
sure v. The parameters 9 > and 6 > are the same as for the gamma process 
subordinator. A gamma process draw, along with its generating Poisson point 
process intensity measure, is illustrated in Figure [S] 

The Dirichlet pr ocess (DP ) is th e random measure formed by normalizing 
the gamma process [Ferguson . 1973| . Since the Dirichlet process atom weights 



sum to one, it cannot be completely random. We can write the Dirichlet process 
D generated from the gamma process G above as: 



C30 



k=l 

Pk = 

OO 



fe=i 



The random variables pk have the same distribution as the Dirichlet process 
sticks (Eq. pT]) ) or normalized gamma process subordinator jump lengths, as 
we have seen above (Example Ul 



Consider sampling points from a Dirichlet process and forming the induced 
partition of the data indices. Theorem [T9l shows us that the distribution of the 
induced partition is the Chinese restaurant process EPPF. 



32 







Figure 8: The gray manifold depicts the Poisson point process intensity measure 
V in Eq. ([20]) for the choice $ = [0, 1] and H the uniform distribution on [0, 1]. 
The endpoints of the hue segments are points drawn from the Poisson point 
process as in Eq. (j2ip . Taking the first coordinate, which is positive real- valued, 
as the atom weights, we find the measure G on $ from Eq. ([2^ in the bottom 
plane. 

Example 24 (Beta process). We can form a completely random measure from 
the beta process subordinator and a random labeling of the feature blocks. If 
the labels are generated iid from a continuous measure then we say the 
completely random measure B, generated as follows, is called a beta process. 



The beta process, along with its generating intensity measure, is depicted in 
Figure |9l Then the (f^) have the same distribution as the beta process sticks 
(Eq. (HH) or the beta process subordinator jump lengths (Example [T^. 

■ 

Now consider sampling a collection of atom locations according to Bernoulli 
draws from the atom weights of a beta process and forming the induced feature 
allocation of the data indices. Theorem [T6l shows us that the distribution of the 
induced feature allocation is given by the Indian buffet process EFPF. 

6.1 Inference 

In this section, we finally study the full model first outlined in the context of 
inference of partition and feature structures in Section [3.4l The partition or fea- 
ture labels described in this section are the same as the block-specific parameters 



v{di X d(j)) = 76^^(1 - S,Y'^d£, ■ H{d(l)) 
{(a,0fc)}fc~PPP(i') 



(23) 
(24) 



00 




(25) 
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Figure 9: The gray manifold depicts the Poisson point process intensity measure 
V in Eq. for the choice $ = [0, 1] and H the uniform distribution on [0, 1]. 
The endpoints of the hue segments are points drawn from the Poisson point 
process as in Eq. ([M)) . Taking the first coordinate, which is [0, l]-valued, as the 
atom weights, we find the measure on $ from Eq. (|25p in the bottom plane. 



first described in Section 13.41 Since this section focuses on a generalization of 
the partition or feature labeling scheme beyond the uniform distribution option 
encoded in subordinators, inference for the atom weights remains unchanged 
from Sections [3111111 and EH 

However, we note that, in the course of inferring underlying partition or 
feature structures, we are often also interested in inferring the data likelihood 
parameters that form the partition or feature labels and govern the data dis- 
tribution within each block. Conditional on the partition or feature structure, 
such inference is handled as in a normal hierarchical model with fixed depen- 
dencies. Namely, the parameter within a particular block may be inferred from 
the data points that depend on this block as well as the prior distribution for 
the parameters. Details fo r the Dirichlet proc e ss example inferred via MCMC 



sampl ii ig are provided b y MacEachern 1994 |. Escobar and West jl995l |. Neall 



2000t : iBlei and Jorda^ j2006j work out details for the Dirichlet process us- 
hig variational method s . In the beta proces s case , [Griffiths and Ghahramani 



20061 .iTeh et all l2007l iThibaux and Jordan! )2007| describe MCMC sampling, 



and iPaislev et al.l |2010l | describe a variational methods approach. 



7 Conclusion 

In the discussion above, we have pursued a sequential augmentation from (1) 
simple distributions over partitions and feature allocations in the form of ex- 
changeable probability functions to (2) the representation of stick lengths en- 
coding frequencies of the partition block and feature occurrences to (3) subor- 
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dinators, which associate random K+-valued labels with each partition block 
or feature, and finally to (4) completely random measures, which associate a 
general class of labels with the stick lengths and whose labels we generally use 
as parameters in likelihood models built from the partition or feature allocation 
representation. 

Along the way, we have focused primarily on two vignettes. We have shown, 
via these successive augmentations, that the Chinese restaurant process specifies 
the marginal distribution of the induced partition formed from iid draws from 
a Dirichlet process, which is in turn a normalized completely random measure. 
And we have shown that the Indian buffet process specifies the marginal dis- 
tribution of the induced feature allocation formed by iid Bernoulli draws across 
the weights of a beta process. 

There are many extensions of these ideas that lie beyond the scope of this 
paper. A number of e xtension s of the CRP and Dirichlet pr ocess exist — in ei- 



ther the EPPF form Pitman . 1996 . Blei and Frazier . 201C| , the stick length 



form [Punson and Parkl 2008 1. or the random measure form Pitman and YoJ. 



1997 1 . Likewise, extensions of the IBP and beta p rocess have been explored [Teh et al 



20071 IPaislev et ahl . l2010l iBroderick eTedl 12012 



More generally, the framework above demonstrates how alternative parti- 
tion and feature allocation mod e ls may be constructed — eit her by introducing 
different EPPFs [Pitmanl Il996l ICnedin and Pitmai]. l2006l or EFPFs, differ- 



ent stick l ength distributions [Ishwaran and Jamesl . |2001| , or different random 
measures [Wolpert and Ickstadtl . 2004 1. 

Finally, we note that expanding the set of combinatorial structures with 
useful Bayesian priors from partitions to the superset of feature allocations sug- 
gests that further such structures might be useful l y exa mined. For instance. 



the beta negative binomial process [Broderick et al.l . 1201 If provides a prior on a 



generalization of a feature allocation where we allow the features themselves to 
be multisets; i.e., each i ndex may have non- negative integer multiplicities of fea- 



tures . Models on trees Adams et al. ^ 2010l . McCullagh et aj] 
2010| . graphs Li and McCallumf 2006| . and permutations 
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Pitma: 



Blei et al 
1996j pro 



vide avenues for future exploration. And there likely remain further structures 
to be fitted out with useful Bayesian priors. 
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